Plan-Per-Tuple Optimization Solution - Parallel Execution of Expensive User-Defined Functions
نویسندگان
چکیده
Object-Relational database systems allow users to define new user-defined types and functions. This presents new optimizer and run-time challenges to the database system on shared-nothing architectures. In this paper, we describe a new strategy we are exploring for the NCR Teradata Multimedia Database System; our focus is directing research for real applications we are seeing. In doing so, we will briefly describe optimizer challenges particularly related to predicate use of large multimedia objects, such as video/audio clips, images, and text documents. The motivation for this work is based on database tuning [SD961 for diverse queries related to multimedia objects. Most notably, expensive and/or high variant user defined functions [He198]. Our approach is referred to as plan-per-tuple. The primary focus being on large objects used as predicate-based terms when a non co-located join is involved in the query. But can also be applicable in non co-located join scenarios also. The execution engine can choose from among N! resource optimization strategies; where N represents system manageable resources. In our case, the N resources are: (i) interconnect saturation levels, (ii) available physical memory, (iii) CPU utilization, and (iv) available disk spool space percentages. However, this technique can be applied to any system resources being managed. The optimizer search space does not include these N! resource optimization strategies per’se, these are execution engine run-time optimization strategies. When the optimizer identifies expensive, or more importantly a high variant, userdefined function in the predicate (via collected statistics), then the optimizer can generate plans that incorporate plan-per-tuple optimization for that particular compiled query. When executing the plan, a different execution strategy can be used per tuple; the available execution choices do not necessarily equal N! We describe when such an overhead for run-time selection is acceptable. 690 1.0 Introduction With the enhancement of SQL3’s typing system, users are allowed to define new user-defined types and functions. In fact, many of our initial customer installations are employing large objects (out-ofline tuple attributes) into their applications; this includes large objects used as predicate-based terms [CS98]. This presents new optimizer and runtime challenges to the system on a shared-nothing MPP-based architectures. In this paper, we outline the strategy that we have been exploring in the NCR Teradata Multimedia Database System (MDBS) in which development started in 1993 [CS98, Cat%, CSI94,OIS+96]. The M-DBS Release 1 was made generally available in 1997. This optimization approach was not incorporated into this release. The system’s fundamental design is based on NCR Teradata Database System (DBS) version 2 (V2). For in-depth discussions on the DBS, see [CK92, CSK95, WCP93]. M-DBS extends the DBS with SQL3 capabilities [Sto96]. A comprehensive analysis of the M-DBS architecture, key new concepts, and user-defined type (UDT) and user-defined function (UDF) optimizer issues can be found in [CSI94, CS98]. Many previous papers analyzed diverse optimization strategies, which include: costbased [SAC+79], rule-based [Fre87], extensible [PHH92], libraries [MBH+96], multimedia [SG96], expensive UDFs [He198], and UDF predicate re-writes [CS96]. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/ or special permission from the Endowment, Proceedings‘ of the 24th VLDB Conference New York, USA, 1998 This paper focuses on overviewing dynamic execution strategies [CG94] based on optimizer statistical decisions; most notably, predicate terms which are most likely defined as large object (LOBS) columns. A LOB is any column that does not fit in-line in the tuple. We will additionally highlight some of the complex UDF issues that arise in shared-nothing architectures; in particular, for non co-located joins when LOBS are in the predicate; this implies LOBS may be moved. As previously noted, we are concentrating on a runtime execution strategy based on optimizer statistics, not optimizer search strategies used when compiling the query. For a comprehensive discussion on optimizer search strategies with expensive predicate terms, see [He198]. The optimizer does not know, nor can it, the complete state of the system when the plan is executed while compiling the query. Moreover, compiled plans are many times cached for latter execution. Additionally, Teradata DBS installations commonly have hundreds to thousands of sessions currently connected to the DBS running typically tens of queries at anyone instant. Flow and load control management as well as systems resource management is critical; system resource must be utilized at lOO%, but no more. Modifying the behavior of a compiled plan based on the system state when it is executed can assist the execution engine in (i) keeping all critical resources at or near peak capacity, and (ii) controlling thrashing through data flow and resource management. As a result, our plan-pertuple solution addresses the presence of variance, hot spots and diverse workloads on the system. In this case, variance is measured in main memory, interconnect saturation, CPU, and disk spool space utilization. The paper is organized as follows: Section 2 briefly overviews the DBS V2 optimizer. Section 3 overviews the M-DBS optimization issues. Section 4 contains a brief description of our plan-per-tuple run-time optimization strategy. Finally, concluding remarks are in section 5. 2.0 Teradata Database V2 Optimizer Overview We will first briefly overview a few of the attributes of the DBS V2 optimizer. A detailed discussion is beyond the scope and purpose of this paper. The DBS has a mature cost-based optimizer for handling SQL-92 queries; rules and re-writes are built-in and cost based. It handles simple stored procedures (persistent stored modules), but does not handle UDFs. Its optimization strategy is based on system throughput. In doing so, it considers three system resources while evaluating all execution strategies at compile time’, which are CPU, disk usage, interconnection network usages, respectively. It relies on “.selectivizy” estimates when invoking its internal formulas; statistics are the key. There are several other important factors such as indexing, table demographics and other statistical information that is kept or generated at query compile-time. Memory is not directly part of the optimizer formulas (it’s correlated to other resources); at compile time it is difficult to predict what memory usage will be when a query is executed and what the memory contention will be during an execution. Our model attempts to address this by using memory utilization considerations at run-time, not only at compile or scheduling time. If the table has no statistics, the optimizer will generate table demographics by random sampling of the table data blocks. If the user has collected statistics on the involved tables, the join plan will always be consistent. The cost model used minimizes the total resource utilization, assuming exclusive use of the system. The optimizer minimizes the sum of the CPU utilization, the disk array utilization (calculated using maximum disk array throughput), and BYNET interconnect utilization. The optimizer also determines whether the use of an index is more efficient than a full file scan over the data. But, all this is compile time considerations, it can not consider what other queries may be running on the system when this query is executed. We believe that under certain UDF invocation classifications (via statistics), runtime considerations should be taken. Especially when LOBS are used as a joining or filtering term in a non co-located join operation. ’ Most poor strategies are pruned very quickly.
منابع مشابه
Optimization of Queries with User-deened Predicates
Relational databases provide the ability to store user-de ned functions and predicates which can be invoked in SQL queries. When evaluation of a user-de ned predicate is relatively expensive, the traditional method of evaluating predicates as early as possible is no longer a sound heuristic. There are two previous approaches for optimizing such queries. However, none of these approaches is able...
متن کاملIvanova Scalable Scientific Stream Query Processing
Ivanova, M. 2005. Scalable Scientific Stream Query Processing. Acta Universitatis Upsaliensis. Uppsala Dissertations from the Faculty of Science and Technology 66. 137 pp. Uppsala. ISBN 91-554-6351-7 Scientific applications require processing of high-volume on-line streams of numerical data from instruments and simulations. In order to extract information and detect interesting patterns in thes...
متن کاملJaCaL: AN IMPLEMENTATION OF LINDA IN JAVA
Java is an object-oriented programming language with built-in features for creating distributed programs. A key feature-set that is missing, however, is 'an easy-to-use, reliable, and scaleable tool for writing truly parallel programs. The Linda para\lel programming model defmes a client-server approach where concurrent execution requests are serviced and results are stored to a shared data rep...
متن کاملPlan Reordering and Parallel Execution - A Parameterized Complexity View
We reconsider the optimization problems for partial-order and parallel plans previously studied by Bäckström (JAIR, 1998), but use parameterized complexity to get a more fine-grained picture. Problems we study include the following: I Can we deorder/reorder the actions of a plan optimally? I What is the optimal parallel execution length of a plan? Plan reordering: to change the order of actions...
متن کاملA Framework for Parallel Processing of Aggregate and Scalar Functions in Object - Relational DBMS 1
Nowadays parallel object-relational DBMS are envisioned as the next great wave but there is still a lack of efficient implementation concepts for some parts of the proposed functionality. One of the current goals for parallel object-relational DBMS is to move towards higher performance. In our view the main potential for performance increases lies in providing additional optimization and execut...
متن کامل